Create and Run Your Own AI Powered Chat Application

Introduction

AI models are constantly evolving and the competition for the best model is fierce. Many of the models available today are powerful enough to drive real productivity gains for yourself or your organization. Ditch the lock-in of running within a single vendor’s proprietary AI chat interface. The biggest productivity gains are realized when the AI model has access to your unique data - memory, documents, source code, etc. But you don’t want this proprietary and valuable information vendor locked into a single AI provider.

The solution is to just build your own AI powered applications. I know this sounds daunting, but it really isn’t. You can start with existing open-source solutions - like Open WebUI - to dip your feet in. But you can even build your own application from scratch and to prove it I’m going to show you how in this article. I’m going to show you how to deploy this “logic layer” for your application on a low-cost compute rental from my own platform, CitadelHosts.com. You don’t need to deploy on CitadelHosts.com for this to be useful - you can follow along and build your own application to run locally or with whatever cloud you prefer, but CitadelHosts.com was purpose built for deployments just like this! On our docs site I have a quick start guide that inspired this article, if you want to skip the details and get straight to the code/deployment I would recommend checking it out.

Our AI Brain, Inference In Python

At the core of our application is going to be our AI model. We need an inference provider. This is the piece that makes our application an AI enabled application.

Running inference, or running requests to an AI model is really easy thanks to the OpenAI API standard. OpenAI manages a Python SDK called openai to allow users to quickly interact with any compatible API. Just because this is called the OpenAI API doesn’t mean we can only interact with OpenAI. Google, Anthropic, and any inference provider provides OpenAI API compatibility.

Then you can interact with Google Gemini as easily as the following code snippet:

from openai import OpenAI

client = OpenAI(
    api_key="GEMINI_API_KEY",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

response = client.chat.completions.create(
    model="gemini-3-flash-preview",
    messages=[
        {   "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Explain to me how AI works"
        }
    ]
)

print(response.choices[0].message)

For our application, we are going to support streaming responses. Streaming responses allows the frontend to display updates to the user as the model generates tokens. Without streaming, the user just waits for the full response to be generated by the model and this does not make for a very responsive feeling application.

response = client.chat.completions.create(
    model="gemini-3-flash-preview",
    messages=[
        {   "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Explain to me how AI works"
        }
    ],
    stream = True
)

for chunk in response:
    delta = chunk.choices[0].delta
    # New text tokens arrive here
    if delta.content:
        print(delta.content, end="", flush=True)
print()  # newline at end

Backend Application

Now that we have our inference code, we need to build the routing application around it. For this example, I have chosen FastAPI as the framework. FastAPI is really useful for spinning up APIs, well, fast! It is a Python framework designed specifically for quickly creating backend logic and we have Python SDKs for inference so it all fits together very easily.

Let’s look at the complete backend application in main.py and break down how it works.

from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from openai import AsyncOpenAI
from typing import List, Dict
import json

app = FastAPI(title="Custom LLM Gateway")

# Allow CORS for local Svelte development
app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:5173", "http://localhost:8080"], # Svelte default ports
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

class ChatRequest(BaseModel):
    base_url: str
    api_key: str
    model: str
    messages: List[Dict[str, str]]

@app.post("/chat")
async def chat_endpoint(request: ChatRequest):
    try:
        # Initialize the OpenAI client with the user's custom config
        client = AsyncOpenAI(
            base_url=request.base_url,
            # Provide a dummy key if empty, as some local endpoints require a non-empty string
            api_key=request.api_key if request.api_key else "sk-no-key-required" 
        )

        # Create streaming response
        async def generate_stream():
            try:
                stream = await client.chat.completions.create(
                    model=request.model,
                    messages=request.messages,
                    stream=True
                )

                async for chunk in stream:
                    if chunk.choices[0].delta.content is not None:
                        yield chunk.choices[0].delta.content
            except Exception as e:
                yield f"Error: {str(e)}"

        return StreamingResponse(
            generate_stream(),
            media_type="text/plain"
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)

Explanation:

Imports and App Setup: We import necessary modules from FastAPI, Pydantic, and the OpenAI SDK. We create a FastAPI instance and configure CORS middleware to allow requests from our Svelte frontend (typically running on ports 5173 or 8080). On CitadelHosts, we will be putting the backend application on the same network as our Svelte frontend, so this will also allow the frontend -> backend communication when deployed.
Request Model: The ChatRequest Pydantic model defines the expected structure of incoming POST requests to the /chat endpoint. It requires:
- base_url: The endpoint of the inference API (compatible with OpenAI API).
- api_key: The authentication key for the inference service.
- model: The model identifier to use for generation.
- messages: A list of message objects, each with a role and content, following the OpenAI chat format.
Chat Endpoint: The /chat endpoint is an asynchronous POST route that:
- Instantiates an AsyncOpenAI client using the provided base_url and api_key. If no API key is provided, it uses a dummy key (some local endpoints like Ollama require a non-empty string).
- Defines an inner async generator function generate_stream that:
  - Calls client.chat.completions.create with stream=True to get a streaming response from the model.
  - Iterates over the stream, yielding each token’s content as it becomes available.
  - Catches any exceptions during streaming and yields an error message.
- Returns a StreamingResponse that wraps the generator, setting the media type to text/plain so the frontend receives raw text chunks.
- Catches any exceptions during client setup (e.g., invalid base URL) and returns a 500 error.
Running the Server: When the script is executed directly, it starts a Uvicorn development server on 0.0.0.0:8000 with auto-reload enabled, making it easy to iterate during development.

This backend acts as a flexible proxy: it forwards the user’s choice of inference endpoint, model, and messages to any OpenAI-compatible API, and streams the response back to the frontend. This design allows you to swap between different providers (like Google Gemini, Anthropic, or a local Ollama instance) without changing the frontend code.

With the backend in place, we can now turn our attention to the frontend application built with Svelte, which will interact with this endpoint to provide a seamless chat experience.

Frontend Application

For the frontend, I have chosen Svelte as the development framework for largely the same reason I chose the backend - it’s really fast to work in. The frontend for this has just a few responsibilities:

Configuration: Allowing the user to specify the inference endpoint, API key, and model.
Chat State: Managing the conversation history, user input, and loading status.
Communication: Sending messages to the backend and processing the streaming response.

Let’s break down the key parts of the frontend code:

// Configuration State
let baseUrl = "http://127.0.0.1:8080/v1"; // Default (e.g., LM Studio)
let apiKey = "";
let model = "qwen3.5-0.8b.gguf";

// Chat State
let messages = [
  { role: "system", content: "You are a helpful AI assistant." }
];
let userInput = "";
let isLoading = false;

State Management:

baseUrl, apiKey, and model store the user’s inference configuration.
messages holds the conversation history, starting with a system message.
userInput binds to the text input field.
isLoading tracks when a request is in progress to disable the UI.

async function sendMessage() {
    if (!userInput.trim()) return;

    // Add user message to UI
    messages = [...messages, { role: "user", content: userInput }];
    const currentInput = userInput;
    userInput = "";
    isLoading = true;

    try {
      const response = await fetch("/api/chat", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          base_url: baseUrl,
          api_key: apiKey,
          model: model,
          messages: messages
        })
      });

      if (!response.ok) {
        const errorData = await response.json();
        throw new Error(errorData.detail || "Failed to fetch response");
      }

      // Initialize assistant message
      messages = [...messages, { role: "assistant", content: "" }];
      
      // Process streaming response
      const reader = response.body.getReader();
      const decoder = new TextDecoder("utf-8");
      let accumulatedContent = "";
      
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        
        const chunk = decoder.decode(value);
        accumulatedContent += chunk;
        
        // Update the last message (assistant) with accumulated content
        messages = [
          ...messages.slice(0, messages.length - 1),
          { role: "assistant", content: accumulatedContent }
        ];
      }
      
    } catch (error) {
      console.error("Error:", error);
      messages = [...messages, { role: "system", content: `Error: ${error.message}` }];
    } finally {
      isLoading = false;
    }
  }

Sending Messages: When the user clicks “Send” or presses Enter, the sendMessage function:

Validates Input: Returns early if the input is empty.
Updates UI Immediately: Adds the user’s message to the chat history and clears the input field, providing instant feedback.
Sets Loading State: Prevents further requests while waiting for a response.
Makes Request to Backend: Sends a POST request to /api/chat (relative to the frontend origin) with the configuration and full message history.

Note: In development, this request is typically proxied to the actual FastAPI backend running on a different port. In production, the frontend and backend can be deployed behind a reverse proxy that routes /api/* to the backend service.
Handles Streaming Response:
- Initializes an empty assistant message in the chat history.
- Reads the response body as a stream using the Fetch API’s getReader().
- Decodes each chunk of bytes as UTF-8 text and accumulates the content.
- After each chunk, updates the last message (the assistant’s) with the accumulated content, causing the UI to update in real-time as tokens arrive.
Error Handling: Catches any network or server errors and adds a system message with the error details.
Cleanup: Resets the loading state in a finally block to re-enable the UI.

<div class="messages">
    {#each messages as message}
    {#if message.role !== 'system' || message.content.startsWith('Error')}
        <div class="message {message.role}">
        <strong>{message.role === 'user' ? 'You' : 'AI'}:</strong>
        <p>{message.content}</p>
        </div>
    {/if}
    {/each}
    {#if isLoading}
    <div class="message assistant"><p><em>Thinking...</em></p></div>
    {/if}
</div>

Displaying Messages: The template iterates over the messages array and displays each message. System messages are only shown if they contain an error (to avoid displaying the initial system prompt). User messages are aligned to the right, and assistant messages to the left, with distinct background colors.

Reactivity: Svelte’s reactivity ensures that whenever the messages array is updated (via the spread operator syntax like messages = [...messages, newMessage]), the DOM updates automatically to reflect the new state.

This frontend provides a clean, responsive chat interface that:

Lets users configure any OpenAI-compatible inference endpoint
Streams responses token-by-token for a smooth user experience
Handles errors gracefully
Maintains conversation history

The major components for our frontend have been outlined in this section but there are some missing pieces for building a full Svelte application. I invite you to view the GitHub repo to see the full frontend folder. The file that we were building is located at https://github.com/LiamJFitzpatrick/citadel-fastapi-svelte-example/blob/main/frontend/src/routes/%2Bpage.svelte.

With both backend and frontend in place, you now have a complete, flexible AI chat application that can work with any provider supporting the OpenAI API standard. You can deploy this to CitadelHosts.com or any other hosting platform, and easily swap between different models and services without changing the code.

Running and Deploying

In order to use the application, you can run it locally with a dev server. But if you want to be able to deploy it on CitadelHosts.com or with your own resources, it would be better to package up the application using Docker images.

For our deployment we will be creating a Dockerfile for the backend and a separate Dockerfile for the frontend.

The backend file installs Python and the dependencies for our FastAPI project, then copies over the main.py file that we created. Finally it defines the running command to be uvicorn - launching the application on port 8000. The file is below.

FROM python:3.11-slim-bookworm

WORKDIR /app

COPY ./requirements.txt ./

# Install FastAPI
RUN pip install -r requirements.txt

COPY ./main.py ./

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

The frontend file builds the Svelte application using the static adapter. It then copies the resulting files to a Nginx base image. The Nginx is configured to serve the static files at the base url and forward any requests for /api/* to our backend application. The nginx.conf and Dockerfile for the frontend are below.

FROM docker.io/library/node:20-alpine AS build

WORKDIR /app

COPY package*.json ./

RUN npm install

COPY . ./

RUN npm run build

FROM docker.io/library/nginx:1.25-alpine

ENV BACKEND_URL=http://host.containers.internal:8000/
COPY nginx.conf /etc/nginx/templates/default.conf.template
COPY --from=build /app/build /usr/share/nginx/html

EXPOSE 80

CMD ["nginx", "-g", "daemon off;"]

server {
  listen 80;
  server_name _;

  root /usr/share/nginx/html;
  index index.html;

  location /api/ {
    proxy_pass ${BACKEND_URL}/;  # trailing slash is the key
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_read_timeout 900s;
  }

  location / {
    try_files $uri /index.html;
  }
}

Now you can package your application and deploy it using Docker. For the purposes of this article, I will refer you to this guide for further reading on deploying this application.

Conclusion

There are many options to quickly build your own AI powered applications. Anyone can use the above guide as a starting point for their own chat interface - it can be expanded to support more features and to suit the needs of the exact use case intended. Going this route allows you to avoid any vendor lock-in and keep the important things - like your data - inside of your own application instead of another company’s servers. The full repository for this application can be found on GitHub at LiamJFitzpatrick/citadel-fastapi-svelte-example.

I hope this article was useful. If there was something unclear that needs more explanation feel free to ask, I was trying to balance focusing on the concept of building an AI powered chat interface without diving too deep into the supporting tools around it. This application is admittedly very basic, but it has the bones to be expanded. I would be happy to expand on this application in another article if anyone is interested in that as well.